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Abstract 

Scaffolding is an important subproblem in de novo genome assembly in which mate pair data 
are used to construct a linear sequence of contigs separated by gaps. Ifere we present SLIQ, 
a set of simple linear inequalities derived from the geometry of contigs on the line that can be 
used to predict the relative positions and orientations of contigs from individual mate pair reads 
and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate 
pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ 
inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex 
mammalian genomes and compared the results to the majority voting procedure used by many 
other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs 
with high accuracy in all cases and gave more accurate position predictions than majority voting 
for complex genomes, in particular the human genome. Finally, we present a simple scaffolding 
algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very 
efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting 
both contig positions and orientations for real data sets. 

1 Introduction 

De novo genome assembly is a classical problem in bioinformatics in which short DNA sequence 
reads are assembled into longer blocks of contiguous sequence (contigs) which are then arranged into 
linear chains of contigs separated by gaps (scaffolds). Modern genome sequencing projects typically 
include mate pair reads in which the approximate distance between a pair of reads plus the two read 
lengths (the insert length) is fixed during the experimental construction of the sequencing library. 
Some genome projects also include mate pair libraries with several different insert lengths. Although 
there are experimental differences between mate pairs and paired-end reads, we will refer to them 
interchangeably as mate pairs in this paper since we can treat them identically from an algorithmic 
point of view. 

Computational genome assembly is typically performed in at least two stages — the contig building 
stage and the scaffolding stage. In this paper we do not address the contig building problem but rather 
assume that we have access to a set of contigs produced by an independent algorithm. However we 
discuss the relationship of the contig building and scaffolding stages later in the Discussion. For 
the scaffolding problem, the most popular strategy is to construct the contig graph in which nodes 
represent contigs and edges represent sets of mate pairs connecting two contigs (i.e. the two reads of 
the mate pair fall in the two different contigs). The edges are given weights equal to the number of 
mate pairs connecting the two contigs. 

A common procedure is to filter out unreliable edges by picking a small threshold (commonly 2-5) 
and removing all edges with weight less than that threshold. For the remaining edges, a majority 
vote is used to decide on the relative orientation and position of the contigs. This simple majority 
voting strategy is implemented in a number of commonly-used assemblers and stand-alone scaffolders 
including ARACHNE [J, B AMBUS [TS;, SOPRA [5^ and SOAPdenovo [H] with various choices of 
threshold. Opera [7] and the Greedy Path-Merging algorithm [TU] use a different strategy to bundle 
edges. Given a set of mate pairs connecting two contigs, these algorithms calculate the median and 
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Figure 1 : The geometry of two contigs, Ci and Cj , arranged on a line with relevant quantities indicated. 
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standard deviation of the insert lengths of the set of mate pairs and create a bundle using only mate 
pairs with insert length that are close to the median. ALLPATHS [3] and VELVET jl^ do not 
build the contig graph and thus do not have a read filtering step similar to the other assemblers 
mentioned. The majority voting procedure implicitly assumes that misleading mate pairs are random 
and independently generated and that majority voting should eliminate the problematic mate pairs. 
However, this assumption is often not true because of the complex repeat structure of large genomes, 
such as human. 

In this paper, we show that unreliable mate pairs can be reliably filtered using SLIQ, a set of 
simple linear inequalities derived from the geometry of contigs on the line. Thus SLIQ produces a 
reduced subset of reliable mate pairs and thus a sparser graph which results in a simpler optimization 
problem for the scaffolding algorithm. More importantly, SLIQ can be used to predict the relative 
positions and orientations of the contigs, yielding a directed contig graph. Our experiments show 
that both SLIQ and majority voting are very accurate at predicting relative orientations but SLIQ is 
clearly more accurate at predicting relative positions for complex genomes. 

The simplicity of SLIQ makes it very easy to integrate as a preprocessing step to any existing 
scaffolders, including recent scaffolders such as MIP scaffolder [TB], Bambus 2 [T2] and SSPACE [5]. 
To illustrate the effectiveness of SLIQ, we implemented a naive scaffolding algorithm that produces 
linear scaffolds from the contig digraph. We show that despite its simplicity, our naive scaffolder 
provides very accurate draft scaffolds, comparable to or improving upon the more complicated sate 
of the art, very quickly. These scaffolds can either be output directly or used as reasonable starting 
points for further refinement with more complex scaffolding algorithms. 

2 Algorithms 

We begin with a high level outline of our algorithm for constructing a directed contig graph (Algorithm 
[T]). The crux of the algorithm is SLIQ, a set of simple linear inequalities that are used to filter mate 
pairs and predict the relative position and orientation of contigs. In subsequent sections, we will 
present proofs for the SLIQ inequalities and a detailed version of the digraph construction algorithm 
(Algorithm [2]) . Finally, we will present a simple scaffolding algorithm (Algorithm [3|) that uses the 
contig digraph to construct draft scaffolds. Throughout the paper we will abbreviate mate pair reads 
as MPR. 

2.1 Definitions and Assumptions 

For the sake of deriving the SLIQ inequalities, we assume that we know the position of the contigs on 
the reference genome. However, this information cancels out later on which allows us to analyze the 
MPRs without access to prior contig position information. For the derivation we also assume that all 
the contigs have the same orientation. Later we will not need this information. 

Let Pi be the position of contig Ci in the genome and U be the length of the contig (Fig. [1]) . We 
define gap gij to be the difference between the start position of contig Cj and the end position of 
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Algorithm 1 Construct Contig Digraph (Outline) 



Require: input: P = a set of MPRs that connect two contigs, C = a set of contigs 
1: Construct the contig graph G with vertex set C and edges representing MPRs from P that pass 

a certain majority cutoff. 
2: Find a good orientation assignment for the contigs {& — {Oi, O2, . . .}) where Qi is the orientation 

of the ith contig, for example by finding a spanning tree of G. 
3: Define Mp to be the set of MPRs that satisfy the SLIQ inequalities 

4: Construct a directed contig graph Gd with vertex set C and edges representing MPRs from Mp 
that pass certain criteria. 



contig Gi, and similarly for gjii 

9^0=P,-P^~h, (1) 

gji ^ Pi — Pj — Ij ■ 

We assume that the maximum overlap of two contigs is one read length, R. In practical contig building 
software based on De Bruijn graphs, the maximum overlap is usually one fc-mer where i? > fc so our 
assumption is valid. 

2.2 Derivation of Two Gap Equations 

If we assume that Pi < Pj as in Fig. [T]and that the maximum overlap between two contigs is R (i.e. 
the minimum gap gij is —R), then 

Pj -P^-h> -R, 

Pj -P^>k-R. (2) 

Now consider the quantity gij — gji . Using ([T]) , we can derive the following inequality which we call 
Gap Equation 1 

5y - g^, = 2 (Pj -Pi) + {I J - h) 

> 2k ~2R + Ij - 

>k + Ij - 2R. (3) 

Therefore, we have shown that {Pi < Pj) ^ {gij — gji > h + Ij — 2R). Next consider the quantity 
gij + gji. We can easily derive Gap Equation 2: 

gij + gji ^ -{Ij +h)- (4) 

Now we will prove the other direction of the implication in Gap Equation 1 and show that {gij — 
gji > k + Ij — 2R) {Pi < Pj). Using Gap Equation 1 and Equation ([T]) , we get 

9ij ~ 9ji > li + Ij - 2i?, 
2(P,- - P,) + {Ij - U) >k + Ij - 2R, 
2{Pj - Pi) > 2k - 2R, 



Pj -P^>k-R■ 



(5) 
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Figure 2: The geometry of two contigs arranged on a line in terms of quantities known in de novo 
assembly. 
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Since no contig length can be less than R, the length of a read, k — R > and hence, Pj — > or 
Pi < Pj. Therefore, (g^ — gji > h + Ij — 2R) => (P,; < P,) and together we have proven, 

{g^J ~ 9n >k + 1 3 - 2P) ^ (Pi < P, ). (6) 

2.3 Using the Gap Equations to Predict Relative Positions 

Our definitions in Equation ([T]) used the quantities Pi and Pj which are not available in practice in 
de novo assembly. Thus we need to define the gaps gij and gji in terms of quantities we know such as 
the insert length L and the read offsets relative to the contigs Oi and oj. Note that the insert length 
for each MPR is an unknown constant so treating it as a constant in the proof is justified. In practice, 
we use L = L + 2a, where L is the reported or computed mean and a is the standard deviation of the 
insert length distribution. 

Let L be the insert length, and oj be the offsets of the start positions of the paired reads in 
Ci and Cj respectively and Qi and Oj be the orientations of Cj and Cj respectively. To simplify the 
notation we abbreviate Qi = Qj as Oi=j and 8^ Qj as Qi^j- Then, if Pi < Pj and Qi=j (see Fig. 
[2]), we can redefine the gaps gij and gji without using the contig start positions Pi and Pj: 

gij = L - li + Oi - Oj ~ R, (7) 
gji — -L - Ij + Oj + R ~ Of (8) 

Note that these definitions remain consistent with Gap Equation 2 (Equation Taking the differ- 
ence of Equations ([7]) and ([5]) we can similarly remove Pi and Pj from Gap Equation 1 : 

g,j - gj, = 2L - 2P + 2{o, - o,) + {Ij -k). (9) 
Using Equations ([9|) and (O, we derive the following inequality: 

2L-2R + 2{oi - Oj) + {Ij - li) > li + - 2R, 
2L + 2{oi - Oj) + {Ij - li) >li + Ij, 
L + (oj - Oj) > li. 

Consequently we obtain that {Pi < Pj) A 01= j L + {oi — Oj) > U. Negating the implication gives 

-(L + (o, - 0,) > h) ^ -((P, < P,) A e,=j), 
L+{oi~ Oj) < k {Pi > Pj) V e^^j. 
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Now without loss of generality we can assume that Oi^j is false. This is possible because our exper- 
iments later show that the SLIQ or majority voting procedures are both very accurate at predicting 
relative orientation (Table [5]) so we can first determine the relative orientations of the contigs and flip 
the orientation of one contig if required. Thus we have 

L + {o^-o,) <l,^ {P,> Pj). (10) 

In addition, we introduce two filters that are very useful in practice for removing unreliable MPRs. 
To derive the first filter, if Pj < Pj, 

L ~ Ij — Oj + Qji + Oi + R, 
> h -Oj - R + Ot+R, 
Oj — Oi > Ij — L, 

Oi - Oj < -Ij + L. (11) 
The second filter is to discard an MPR if it passes the test for both P,; < Pj and Pj < Pi. 



2.4 Using the Gap Equations to Predict Relative Orientations 

So far we have only predicted relative positions when Qi=j- Now we show that we can also use the 
gap equations to infer the relative orientations of the contigs. First, if {Pi < Pj) and the minimum 
gap is —R then we have 

g^^ ^ L ~ k + o^ - Oj - R> -R. (12) 
Similarly, if (Pj < Pi), then we define gji and write 

gji — L — Ij + Oj ~ Oi — R > —R. (13) 

Note that gji is different than gji which we defined under the assumption Pi < Pj in Equation ([8]). 

Since {Pi < Pj) and {Pj < Pi) are mutually exclusive and exhaustive neglecting Pi = Pj, at least 
one of Equations (fT2)) and (fT3)) will be true. Note that possibly also both could be true. For example, 
if Pi < Pj then gij > —R. Now {Pj < Pi) must be false, but that does not imply that gji > — i? 
is false. If both Equations and (I13|) are true, then we can add them to get 2L > li + Ij. To 
summarize, 

{{grj > -R) A {fjr, > -R)) ^2L> k + h, 

2L<k + Ij {^{g,j > -R) V ^{gj, > -R)) 

Recalling again that at least one of Equations p2| and ([13]) are true, we see that 2L < li + Ij is a 
sufficient condition for mutual exclusion (the XOR relation is denoted by ©): 



Qi=j A {2L <li + Ij) =^ {gij > -R) (B {gj^ > -R), 
-((5»j > ~R) ® {9J^ > -Ft)) -(0^=J A (2L < k + Ij)), 

--{{g^J > ~R) ® {gJ^ > -R)) (e.^j v {2L >k + 1,)). 

(14) 

If we use this equation only when the MPR and contigs satisfy the inequality 2L < li + Ij, we can 
then make the relative orientation prediction 

^{{g^J > -R) © {g3^ > -R)) ^ (15) 
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Intuitively, the condition 2L < li + Ij means that the contig lengths should be large relative to the 
insert length in order for the SLIQ method to work. To find contigs of the same orientation, we 
arbitrarily flip one contig and run the above tests again, only this time if Equation (1151) holds, then 
we conclude that the contigs were actually of the same orientation. Say we flip C;. We call the new 
offset or-. Then 

-((ffT, > -R) © % > ~R)) ^ Bj^, => 

Again, we introduce two additional filters that are very useful in practical applications. First, if 
we find an MPR that predicts both Qi^j and Qi^j then we leave it out of consideration. Second, 
if the SLIQ equations imply Qi^j , then we require that both the reads of the MPR have the same 
mapping directions on the contigs and similarly for Qi=j- 

We summarize our results in the following lemmas and Algorithm [2] 

Lemma 1 // the maximum overlap between contigs is R and 2L < li + Ij , then 

A{9^] > -R) ® fei > -R)) Qiyij, 

-((ffTj > -R) ® % > -R)) ^ - 

Lemma 2 If the maximum overlap between contigs is R, the contigs have the same orientation, (i.e. 
then 

{L + {o,-Oj)<k)^{P,>Pj). 

We also summarize the SLIQ inequalities, 

9 1] - 9]i >k-lj- 2i?, 
9i] +9]i = -{Ij + U), 
{9i3 - 9j^ >h + Ij - 2R) {Pi <Pj), 

g,j - g.ji = 2L - 2i? + 2(o, - Oj ) + {Ij - k). 



2.5 Illustrative Cases and Examples from Real Data 



Figure 3: Illustrative cases in which both reads of the MPR fall in the center of the contigs (left) and 
the contigs have reversed positions (right). 
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In this section we present two illustrative cases that provide the intuition underlying the SLIQ 
equations. The ideal case for an MPR connecting two contigs is illustrated in Fig[T] In that case the 
contigs are long compared to the insert length and the reads are mapped to the ends of the contigs. 
However, this situation does not always occur. Suppose the contigs are short such that the two reads 
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Algorithm 2 Construct Contig Digraph 

Require: input: M = a. set of MPRs connecting contigs, C = a set of contigs, w =cutoff weight 
1: Define E' = {{Ci,Cj) : an MPR connects Q and Cj} 

2: Let wt{i,j) = (number of MPRs suggesting that Cj and Cj have the same orientation) — (number 

of MPRs suggesting that Ci and Cj have different orientations) 
3: E = {{C\,Cj) : G E'Awt{i,j) > w} 

4: Construct a contig graph G with vertex set C and edge set E. 

5: Find a good orientation assignment (© = {©i, O2, . . .}) for the contigs, for example, by finding a 

spanning tree of G. 
6: Set Mp = {} 
7: for all p : p € M do 

8: Let Cj and Gj be the contigs connected by p. 
9: if Oj=j then 

10: if (L + (oj - Oj) < li) AND (oj - Oj < -k + L) then 
11: predict Pj > Pj 

12: Mp = Mpli {p) 

13: end if 

14: if (L + {oj -Oi) <lj) AND {oj -Oi<-lj+ L) then 
15: predict Pj < Pj 

16: Mp = Mp U {p) 

17: end if 
18: end if 

19: end for 

20: Let \E{i,j)\ be the number of MPRs from Mp that predict that P^ < Pj 

21: Define E^ = {(Cj,Cj) : \E{i,j)\ > \E{j,i)\} 

22: Output a contig digraph Gd with vertex set C and edge set E4. 
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of an MPR fall exactly in the center of the contigs. Then the right hand side of Equation (jH]) reduces 
to 2L — 2R. So for both cases Pi < Pj and Pj < Pi, the right hand side of Equation © has the 
same value, making it impossible to predict the relative positions of the two contigs. This situation 
is illustrated in Fig. |3]on the left. It is easy to see that prediction becomes easier as the contigs get 
longer and the reads move away from the center of the contigs. 

Now assume that the working assumption is Pi < Pj but in reality, the reverse {Pj < Pi) is true. 
Then given that the contigs are long and reads map to the edges of the contigs, the insert length 
L would suggest the scenario depicted in Fig. [3] (right side). This would make both gij and gji (as 
calculated from Equations ([7]) and ([S])) smaller than they should be. In reality, the position of the 
contigs is similar to that shown in Fig. [T] where we see that both gij and gji are larger than in Fig. [3] 
(right side). These wrong values would then be too small to satisfy the left hand side of Equation ^ 
and this would demonstrate that the working assumption of Pi < Pj is wrong. 



Figure 4: Three real examples of SLIQ predictions from the PSY dataset. For the correct prediction 
the equation L + (oi ~ Oj) < U evaluates to 3385 < 5043. In the wrong prediction, it should have 
satisfied L + [oj ~ Oi) < Ij but one of the contigs is smaller than the insert length so it evaluates to 
262 < 217 (false). However L + (oi — Oj) < li evaluates to 498 < 863 so the wrong prediction is made. 
In the no prediction case, the condition Oi — Oj < —Ij +L is violated. Even if that did not fail, since one 
of the offsets falls almost in the center of a contig, both the conditions L + (oj — Oi) < Ij, (299 < 1384) 
and i + (oi — Oj) < k, (461 < 506) are satisfied and we would not give a prediction for this MPR. To 
simplify the calculations we used L = 380. 



Correct Prediction Wrong Prediction 
o, = 3051[ 1 Oj = 46 Oj = 99 1 \o^ = 217 



h = 3149 Ij = 5049 Ij = 217 k = 863 



No Prediction 



o, = 275| 1 Oj = 194 



h = 506 Ij = 1384 



It is also instructive to consider examples from real data. We show three cases from a real data 
set: one in which SLIQ made a correct prediction, one in which SLIQ made a wrong prediction and 
one where SLIQ did not make any predictions (Fig. |4]). We explain precisely which inequalities are 
violated in the figure caption. The real examples show the difficulties of making SLIQ predictions 
when the reads fall close to the center of a contig or when the contig lengths are small relative to the 
insert size. 

2.6 Naive Scaffolding Algorithm 

The contig digraph constructed in Algorithm [2] can be directly processed to build linear scaffolds. To 
illustrate this point, here we present a naive scaffolding algorithm (Algorithm [Sj . 

To analyze the computational complexity of the naive scaffolding algorithm, let N be the number 
of MPRs in the library. Constructing G takes 0{N) time. Finding articulation points takes 0{n + m) 
time where n = \V\ and m = \E\ 9 . If we have a articulation nodes, then finding junctions takes 0(an) 
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Algorithm 3 Naive Scaffolder 
1: G(y,E) ^Construct Contig Digraph (Algorithm [5]) 

2: Identify and remove junctions from G. Junctions are defined as articulation nodes with degree > 
3 that connect at least 3 subgraphs of G of size larger than some given threshold. The size of a 
subgraph is defined as the sum of all contig sizes in that subgraph. 

3: Identify all simple cycles in G and remove the edge with the lowest weight from each simple cycle. 

4: If G still contains strongly connected components, those components are removed. G is now a 

directed acyclic graph. 
5: Output each weakly connected component of G as a separate scaffold. 

6: The order of contigs in each scaffold is computed by taking the topological ordering of the nodes 
of their respective weakly connected component in G. 



time. Identifying and breaking simple cycles takes 0{{n + m)(c + 1)) time where c is the number of 
simple cycles [11 . Finally, topological sorting takes 0{n+m) time. In total the complexity of the naive 
scaffolding algorithm is 0{N)+0{n+m)+0{a7i)+0{{n+m){c+l)) = 0{N)+0{an)+0{{n+m){c+l)). 
In practical data sets, a and c are small constants and N » n,m. Thus for practical purposes the 
time complexity of the algorithm is 0{N). 

3 Experimental Results 



Table 1: Descriptive statistics about the datasets. R is the read length, cov is the coverage, L is 
the reported insert length, is the real insert length calculated by mapping reads to the reference 
genome and cr is the standard deviation of Lr- 



Set ID 


Organism 


Size. 


Ref. Genome 


Read Lib 


R 


cov 


L 




a 


PSU 


P. suwonensis 


3.42 Mb 


CP002446.1 


SRR097515 


76 


870x 


300 


188.78 


18.77 


PSY 


P. syringae 


6.10 Mb 


NC.007005.1 




36 


40x 


350 


384.11 


67.13 


SY-CE 


C. elegans 


100.26 Mb 


NC.003279-85 


SRR006878 


35 


38x 


200 


232.13 


54.44 


PST 


P.stipitis 


15.40 Mb 


4^ 




75 


25x 


3.2K 


3.27K 


241.50 


DS 


D.simulans 


109.69 Mb 


NT.167066.1- 
68.1, 

NT.167061.1, 
NC.011088.1- 
89.1, 

NC.005781.1 


SRR121548, 
SRR121549 


36 


62x 


N/A 


187.99 


61.47 


SY-HS 


H. Sapiens 


3.30 Gb 


NCBI36/ hglS 


ERA015743 


100 


45x 


300 


310.63 


20.74 


HS 


H. Sapiens 


3.30 Gb 


NCBI36/ hgl9 


ERA015743 


100 


45x 


300 


310.63 


20.74 



To demonstrate the performance of our algorithms in practice, we ran them on five real data sets 
and two synthetic data sets. The data sets represent genomes ranging in size from small bacterial 
genomes (3Mb) to large animal genomes (3.3Gb) (see Table [T] for details). 

For each data set, we obtained a publicly available mate pair library. We used publicly available 
pre-built contigs for the Drosophila simulans (DS) and human (HS) [8] data sets. Pre-built contigs 
were not available for the three microbial data sets — P. suwonensis (PSU), P. syringae (PSY) and P. 
stipitis (PST) — so we used the short read assembler VELVET [T7] to construct contigs. All software 
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Table 2: Summary of the results of SLIQ vs. majority filtering for contig graph edges of five real 
datasets. Here, n is the total number of edges comiecting two different contigs, We is the minimum 
wieght of an edge for SLIQ prediction, Uq is the number of edges for which we can predict relative 
orientation, Co is the accuracy of relative orientation prediction. Hp is the number of edges for which we 
can predict relative position, Cp is the accuracy of relative position prediction and w,n is the minimum 
weight of an edge for majority prediction. The same notations is used for majority filtering except 
with prime. 



Set ID 


n 


We 




Go 


Hp 


ep 




< 




n'p 


4 


PSU 


U54: 


2 


2507 


99.69% 


3803 


99.21% 


4 


3942 


99.59% 


3925 


94.87% 


PSY 


2086 


2 


1628 


98.40% 


1852 


95.62% 


4 


2019 


98.56% 


1990 


98.59% 


PST 


2291 


1 


1233 


75.18% 


1516 


87.33% 


2 


1365 


97.87% 


1336 


16.54% 


DS 


8738 


1 


6305 


92.18% 


7097 


80.55% 


2 


6390 


91.87% 


5861 


77.25% 


HS 


36346 


1 


31799 


79.56% 


31153 


89.71% 


2 


32676 


79.14% 


25750 


75.62% 



parameters and sources for the data are provided in Table For the two synthetic datasets, C. 
elegans (SY_CE) and human (SY-HS), we constructed contigs by mapping reads back to the reference 
genome and declaring high coverage regions to be contigs. We will discuss the performance of the 
algorithms on the synthetic data sets at greater length in the Discussion. We mapped the reads to 
the contigs using the program Bowtie (v. 0.12.7) [13]. Below we only report results for the uniquely 
mapped reads because we know the ground truth for them. 

3.1 Comparison of SLIQ and Majority Voting Predictions 

On all the real data sets, SLIQ was highly accurate in predicting both relative orientation (> 75%) 
and position (> 80%) (Table [2|). For orientation prediction, SLIQ and majority filtering produced 
almost identical accuracies except for the case of P. stipitis (PST) where SLIQ had lower accuracy. 
One possible reason might be that the PST library used long mate pair reads which may be more 
inaccurate than the other libraries we tested. Conversely, for PST, majority voting gave far worse 
accuracy (16.5%) than SLIQ (75%) in relative position prediction, confirming that this data set is an 
outlier. 

Focusing only on the position predictions, SLIQ showed a significant advantage in both the number 
and accuracy of the predictions compared to majority voting for the more complex genomes — D. 
simulans and human (Fig. [5]) . Importantly, the improvement was particularly large for the human 
genome. 

Finally, Table |3] gives a more detailed comparison of cases where the SLIQ and majority voting 
predictions disagreed. When the two methods disagreed, SLIQ clearly outperformed majority voting 
procedure. For example, for human, when the methods disagreed, SLIQ was right in 1852 cases and 
majority voting in only 165 cases. SLIQ was also generally more accurate when considering only the 
predictions made uniquely by each method, except in one case (PSY). 



3.2 Computing the Optimal Insert Length 

In our experiments, we found that using a slightly larger value for L than that reported or estimated 
increased both Up, the number of MPRs for which we could make a relative position prediction, and 
Cp, the accuracy of relative position prediction. This may seem surprising at first given Equation 
(|10|) . However, for Hp it can be seen from Fig. [T] that underestimating L would reduce gij which 
would lead to more overlaps between contigs. Since we assume that the maximum contig overlap is 
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Table 3: Comparison of position predictions between the SLIQ and majority voting methods. Here, 
ria is the number of predictions where the methods agreed, Ud is the number of predictions where the 
methods disagreed, Ud^ is the number of predictions not in agreement where SLIQ was correct, Ud^ 
is the number of predictions not in agreement where majority voting was correct, n'^ is the number 
of predictions made only by SLIQ, Cq is the accuracy of predictions made only by SLIQ, is the 
number of predictions made only by majority voting, is the accuracy of predictions made only by 
majority voting. 



Set ID 




nd 






< 


eq 






PSU 


3089 


646 


643 


3 


68 


95.58% 


190 


90.52% 


PSY 


1519 


287 


235 


52 


46 


86.95% 


184 


96.19% 


EST 


290 


794 


784 


10 


432 


58.56% 


252 


25.00% 


DS 


2447 


820 


804 


16 


409 


93.15% 


2035 


76.41% 


HS 


16425 


2017 


1852 


165 


12711 


85.67% 


7308 


52.73% 



Table 4: Parameter values used in the analysis of all datasets. v is the number of mismatches allowed 
in read mapping (Bowtie v. 0.12. 7). 



Data Set 


V 


contig construction 


contig mapping 


PSU 


2 


(velvet) Hash lcngth=21, cov_cutoff=5, 
min_contig Jgth= 150 


(vmatch) min match length I = 150, 
Hamming distance h = 


PSY 





(velvet) Hash length=21, cov_cutoff=5, 
min_contig Jgth= 150 


(vmatch) min match length I = 150, 
Hamming distance h = 


PST 





(velvet) Hash length=35, cov_cutoff= 
auto, min_contigJgth=100 


(vmatch) min match length I = 200, 
Hamming distance h = 5 


SY-CE 


1 


(synthetic) cov cutoff=5, min contig 

lcn=i 


available from synthetic construction 


DS 


2 


accession number AASROIOOOOOI- 
AASR01050477 


(vmatch) min match length I = 200, 
Hamming distance h = 5 


SY-HS 


2 


(synthetic) cov cutoff=3, min contig 
len=2i? 


available from synthetic construction 


HS 


3 


accession number 
AEKP01000001:AEKP01231194 


(vmatch) min match length I = 300, 
Hamming distance h = 
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Figure 5: Comparison of the accuracy of SLIQ and majority voting for relative position prediction 
using that same data shown in Table [5] 



40000 



35000 



30000 



g 25000 
o 



1 20000 

O 

^ 15000 



10000 



5000 



H Correct 

1 Wrong 

H No Prediction 



PSU 



PSY 



HSR 



DS 



PST 



maj maj maj maj maj 



R, underestimating L would remove many MPRs from the predictions. However, at the moment we 
do not have an explanation for the observed increase in ep, the prediction accuracy. 

On the other hand, using a slightly smaller value for L increased Uq, the number of MPRs for which 
we could make a relative orientation prediction, while Co, the prediction accuracy for orientation, 
remained constant. We suspect that a lower L makes Equation (fT2)l and (fT3)l harder to pass and thus 
less MPRs are excluded by the mutual exclusion test. 

3.3 Computing the Rank of MPRs 

Our experimental results also agree with our illustrative cases (section 12. 5p in that the prediction ac- 
curacy decreases as 2{oi — Oj) gets closer to {k — Ij) which intuitively means that the reads are falling 
closer to the center of the contigs. To address this issue we can rank the MPRs by the minimum value 
of c for which they fail to pass the more stringent inequality \2{oi — Oj) — {U — > cR. We say that 
an MPR has rank c if and only if c is the smallest positive integer such that \2{oi — Oj)~ [U — lj)\ < cR 
and MPRs with higher rank are considered more confident with regards to their prediction. Fig. [6] 
shows how the prediction accuracy depends on the rank of the MPRs in the PSY dataset. 
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Figure 6: Change in the prediction accuracy, Cp, as we restrict our analysis to MPRs of higher rank 
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3.4 Performance of the Naive Scaffolder 

We summarize the resuhs of our naive scaffolder on the five real data sets in Table [5] and Table IHl 
For all data sets, the orientation accuracy was very high (> 97%) and the position accuracy was also 
high (> 89%). While the genome coverages of PSU and DS may appear surprising, note that the 
PSU library had a very high coverage while the DS library had low coverage and was also made up 
of a number of different D. simulans strains. It is likely that the PSU contigs include misassembled 
fragments in the contigs, making the total length of the contigs larger than the genome size. For DS, 
the combination of low coverage and relatively high rates of sequence differences between the different 
D. simulans strains likely resulted in lower genome coverage. 



Table 5: Summary of the results of our naive scaffolder on real data. N50 is the length n such that 50% 
of bases are in a scaffold of length at least n. The position accuracy measures how many neighboring 
contigs in the scaffold were placed in the correct order. 



Data Set 


N50 


Genome 
Coverage 


Orientation 
Accuracy 


Position 
Accuracy 


PSU 


17K 


116.1% 


99.64% 


97.95% 


PSY 


75K 


90.98% 


98.26% 


93.42% 


PST 


215K 


97.89% 


98.90% 


89.89% 


DS 


942 


59.48% 


97.52% 


96.07% 


HS 


18k 


79.27% 


98.28% 


98.03% 
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Table 6: Run time comparison of our Naive Scaffolder with two other state-of-the-art scaffolders, 
SOPRA and MIP Scaffolder. All times are the sum of the user and system times reported by the 
Linux time command. We ran all software on a 48 core Linux server with 256GB of memory. [NOTE 
to reviewers: MIPS has been running for more than 1500 minutes and we will insert exact running 
times in the final manuscript] 



Data Set 


Naive Scaffolder 


SOPRA 


MIP Scaffolder 


PSU 


6ni40.39s 


237m27.237s 


> 1200m 


PSY 


59.36s 


44m57.604s 


> 1200m 


PST 


67.21s 


3009m29.224s 


> 1200m 


DS 


7m7.449s 


N/A 


> 1200m 


HS 


241m33.928s 


N/A 


> 1200m 



4 Discussion 

In conclusion, we have presented a mathematical approach and an algorithm for constructing a contig 
digraph that encodes the relative positions of contigs based on mate pair read data. Our main insight 
is the derivation of a set of simple linear inequalities derived from the geometry of contigs on the line 
that we call SLIQ. We can use SLIQ both to efficiently filter out unreliable mate-pair reads (MPR) and 
predict the relative positions and orientations between contigs. We have shown that SLIQ outperforms 
the commonly used majority voting procedure for the prediction of relative position of contigs while 
both methods are very accurate for orientation prediction. The contig digraph can also be directly 
processed into a set of linear scaffolds and we have presented a simple scaffolding algorithm for doing 
so. Our naive scaffolder has high accuracy on all data sets tested and is very efficient — for practical 
purposes, as it takes time linear in the size of the mate pair library and it is also very fast compared 
to other state-of the art scaffolders. The output of our naive scaffolder can either be used directly as 
draft scaffolds or used as a reasonable starting point for refinement with more complex optimization 
procedures used in other scaffolders. 

One interesting and unexpected finding of our experiments was that the simple majority voting 
procedure performs very well for predicting the relative positions of contigs if the contigs have few 
errors. This can be seen by the performance of the majority voting procedure when using synthetic 
contigs that are not constructed using de novo assembly tools but rather by mapping the reads back to 
a reference genome and identifying regions of high coverage which is expected to produce much higher 
quality contigs (Table [7]) . This observation suggests a novel way to approach the scaffolding problem 
in which the contig builder would output smaller but higher quality contigs and allow the scaffolder to 
handle the remainder of the assembly. We believe this is a significant change in philosophy of genome 
assembly programs to date in which during the contig building step, one generally attempts greedily 
to build contigs that are as long as possible. This view point also differs considerably from previous 
approaches to scaffolding in which the focus was on resolving conflicts between mate pairs that gave 
conflicting information about the relative orientation and position of contigs. 

Finally, we are exploring several possible extensions of the SLIQ method. The first extension is 
to find the optimal value for L, the insert length, so that we optimize the number and accuracy of 
relative position and orientation predictions. The second extension is to assign numerical values to 
the accuracy of prediction of MPRs of a particular rank. Finally, for the multiply mapped MPRs 
which were not included in the results, we plan to identify the most likely mapping for the MPR, for 
example by using their ranks. 
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Table 7: Summary of the results of majority prediction for synthetic datasets for C. elegans (SY_CE) 
and humans (SY_HS). n is the total number of edges connecting two different contigs, Wm is the 
minimum weight of an edge for majority prediction, Uf, is the number of edges for which we can 
predict relative orientation, Cg is the accuracy in relative orientation prediction, rip is the number of 
edges for which we can predict relative position and Cp is the accuracy in relative position prediction 



Data Set 


n 


W„i 




Co 


Up 




SY-CE 


17620 


3 


17620 


99.52% 


17532 


99.85% 


SY-HS 


878380 


3 


878380 


98.93% 


868877 


99.47% 
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